首页> 外文OA文献 >Kolmogorov Complexity, Information Theory and Compression Based Classification of Biological Sequences and Structures: Theory versus Practice
【2h】

Kolmogorov Complexity, Information Theory and Compression Based Classification of Biological Sequences and Structures: Theory versus Practice

机译:Kolmogorov复杂性,信息论和基于压缩的分类 序列和结构研究:理论与实践

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

BACKGROUNDSimilarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available.RESULTSWe experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the methodology to discriminate and classify biological sequences and structures. A second set of experiments aims at assessing how well two commonly available classification algorithms, UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and NJ (Neighbor Joining), can use the methodology to perform their task, their performance being evaluated against gold standards and with the use of well known statistical indexes, i.e., the F-measure and the partition distance. Based on the experiments, several conclusions can be drawn and, from them, novel valuable guidelines for the use of USM on biological data. The main ones are reported next.CONCLUSIONUCD and NCD are indistinguishable, i.e., they yield nearly the same values of the statistical indexes we have used, accross experiments and data sets, while CD is almost always worse than both. UPGMA seems to yield better classification results with respect to NJ, i.e., better values of the statistical indexes (10% difference or above), on a substantial fraction of experiments, compressors and USM approximation choices. The compression program PPMd, based on PPM (Prediction by Partial Matching), for generic data and Gencompress for DNA, are the best performers among the compression algorithms we have used, although the difference in performance, as measured by statistical indexes, between them and the other algorithms depends critically on the data set and may not be as large as expected. PPMd used with UCD or NCD and UPGMA, on sequence data is very close, although worse, in performance with the alignment methods (less than 2% difference on the F-measure). Yet, it scales well with data set size and it can work on data other than sequences. In summary, our quantitative analysis naturally complements the rich theory behind USM and supports the conclusion that the methodology is worth using because of its robustness, flexibility, scalability, and competitiveness with existing techniques. In particular, the methodology applies to all biological data in textual format. The software and data sets are available under the GNU GPL at the supplementary material web page.
机译:背景技术序列的相似性是生物学分类和系统发育研究的关键数学概念。当前主要使用路线来处理。然而,比对方法似乎不足以用于基因组后研究,因为它们不能随数据集大小而很好地扩展,而且似乎仅局限于基因组和蛋白质组序列。因此,积极地追求无比对相似性度量。其中,USM(通用相似性度量标准)获得了显着的发展。它建立在柯尔莫哥洛夫(Kolmogorov)的深刻理论基础上,复杂性和普遍性是其最新颖的特征。由于只能通过数据压缩来近似,因此USM是一种方法,而不是用于量化两个字符串相似度的公式。提供了USM的三种近似值,分别是UCD(通用压缩不相似),NCD(归一化压缩不相似)和CD(压缩不相似)。在各种数据集上测试了它们的适用性和鲁棒性,得出了第一个大规模的定量估计,即USM方法及其近似值是有价值的。尽管围绕USM开发了丰富的理论,但其实验评估还是有局限性的:只有少数数据压缩器与USM一起进行了测试,并且大多在定性水平上进行,没有UCD,NCD和CD之间的比较,也没有USM与现有产品的比较结果我们似乎通过使用25种压缩器,其全部三个已知近似值和与分子生物学相关的六个数据集对USM方法进行了实验测试。这提供了这种方法的首次系统和定量实验评估,自然补充了许多可用的理论和初步实验结果。而且,我们将USM方法论与基于比对的方法和不基于比对的方法进行了比较。我们可以将实验分为两组。第一个通过ROC(接收器工作曲线)分析执行,旨在评估该方法对生物序列和结构进行区分和分类的内在能力。第二组实验旨在评估两种常用的分类算法UPGMA(带算术平均值的非加权对群方法)和NJ(邻居合并)如何使用该方法执行任务,并根据金标准和标准对它们的性能进行评估。使用众所周知的统计指标,即F度量和分区距离。基于实验,可以得出几个结论,并从中得出关于在生物学数据上使用USM的新颖有价值的指导原则。结论UCD和NCD难以区分,即它们在实验和数据集上产生的统计指标值几乎与我们使用的统计指标相同,而CD几乎总是比两者差。 UPGMA似乎在NJ方面产生了更好的分类结果,即在相当一部分实验,压缩器和USM近似选择中,统计指标的值更好(相差10%或更高)。在通用数据和基于DNA的Gencompress的基础上,基于PPM(部分匹配预测)的压缩程序PPMd是我们使用的压缩算法中性能最高的,尽管它们之间的性能差异(通过统计指标衡量)与其他算法主要取决于数据集,并且可能不如预期的大。与UCD或NCD和UPGMA一起使用的PPMd在序列数据上非常接近,但比对方法的性能更差(F度量的差异小于2%)。但是,它可以根据数据集大小很好地扩展,并且可以处理序列以外的数据。总而言之,我们的定量分析自然补充了USM背后的丰富理论,并支持以下结论:该方法具有鲁棒性,灵活性,可扩展性和与现有技术的竞争力,因此值得使用。特别是,该方法适用于文本格式的所有生物数据。该软件和数据集可在补充材料网页上的GNU GPL下获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号